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Abstract 

In this paper, we consider the task of learn¬ 
ing control policies for text-based games. 
In these games, all interactions in the vir¬ 
tual world are through text and the un¬ 
derlying state is not observed. The re¬ 
sulting language barrier makes such envi¬ 
ronments challenging for automatic game 
players. We employ a deep reinforcement 
learning framework to jointly learn state 
representations and action policies using 
game rewards as feedback. This frame¬ 
work enables us to map text descriptions 
into vector representations that capture the 
semantics of the game states. [] We eval¬ 
uate our approach on two game worlds, 
comparing against baselines using bag-of- 
words and bag-of-bigrams for state rep¬ 
resentations. Our algorithm outperforms 
the baselines on both worlds demonstrat¬ 
ing the importance of learning expressive 
representations. 


I State 1: The old bridge 
I You are standing very close to the bridge’s 
I eastern foundation. If you go east you will 
I be back on solid ground ... The bridge 
I sways in the wind. 

Cojnmand: Go east_ 

I State 2: Ruined gatehouse 
1 The old gatehouse is near collapse. Part of 
I its northern wall has already fallen down ... 

I East of the gatehouse leads out to a small 
I open area surrounded by the remains of the 
1 castle. There is also a standing archway of- 
I fering passage to a path along the old south- 
[ ern inner wall. 

I Exits: Standing archway, castle comer, 

1 Bridge over the abyss 


Eigure 1: Sample gameplay from a Eantasy World. 
The player with the quest of finding a secret tomb, 
is currently located on an old bridge. She then 
chooses an action to go east that brings her to a 
ruined gatehouse (State 2). 


1 Introduction 

In this paper, we address the task of learning con¬ 
trol policies for text-based strategy games. These 
games, predecessors to modem graphical ones, 
still enjoy a large following worldwide]^ They of¬ 
ten involve complex worlds with rich interactions 
and elaborate textual descriptions of the underly¬ 
ing states (see Eigure[T]). Players read descriptions 
of the current world state and respond with natural 
language commands to take actions. Since the un¬ 
derlying state is not directly observable, the player 
has to understand the text in order to act, making it 

‘Both authors contributed equally to this work. 

'Code is available at http: / /people . csail. mit. 
edu/karthikn/raud-play 
^http://mudstats.com/ 


challenging for existing AI programs to play these 
games (DePristo and Zubek, 20011. 

In designing an autonomous game player, we 
have considerable latitude when selecting an ad¬ 
equate state representation to use. The simplest 
method is to use a bag-of-words representation 
derived from the text description. However, this 
scheme disregards the ordering of words and the 
finer nuances of meaning fhaf evolve from com¬ 
posing words info senfences and paragraphs. Eor 
insfance, in Sfafe 2 in Eigure fhe agenf has fo 
undersfand fhaf going east will lead if fo fhe cas- 
fle whereas moving south will lake if fo fhe sland- 
ing archway. An allernalive approach is fo converl 
lexl descriptions fo pre-specified represenlafions 
using annolafed framing dafa, commonly used in 


language grounding fasks (Mafuszek el ah, 2013 














Kushman et al., 2014| ). 

In contrast, our goal is to learn useful represen¬ 
tations in conjunction with control policies. We 
adopt a reinforcement learning framework and for¬ 
mulate game sequences as Markov Decision Pro¬ 
cesses. An agent playing the game aims to maxi¬ 
mize rewards that it obtains from the game engine 
upon the occurrence of certain events. The agent 
learns a policy in the form of an action-value func¬ 
tion Q{s, a) which denotes the long-term merit of 
an action a in state s. 

The action-value function is parametrized us¬ 
ing a deep recurrent neural network, trained us¬ 
ing the game feedback. The network contains two 
modules. The first one converts textual descrip¬ 
tions into vector representations that act as prox¬ 
ies for states. This component is implemented us¬ 
ing Long Short-Term Memory (LSTM) networks 


2009 Branavan et ah, 2011al. 


(Hochreiter and Schmidhuber, 19971. The second 


module of the network scores the actions given the 
vector representation computed by the first. 

We evaluate our model using two Multi-User 
Dungeon (MUD) games ( Curtis, 1992||Amir and 


Doyle, 20021. The first game is designed to pro¬ 


vide a controlled setup for the task, while the sec¬ 
ond is a publicly available one and contains hu¬ 
man generated text descriptions with significant 
language variability. We compare our algorithm 
against baselines of a random player and mod¬ 
els that use bag-of-words or bag-of-bigrams rep¬ 
resentations for a state. We demonstrate that our 
model LSTM-DQN significantly outperforms the 
baselines in terms of number of completed quests 
and accumulated rewards. For instance, on a fan¬ 
tasy MUD game, our model learns to complete 
96% of the quests, while the bag-of-words model 
and a random baseline solve only 82% and 5% of 
the quests, respectively. Moreover, we show that 
the acquired representation can be reused across 
games, speeding up learning and leading to faster 
convergence of Q-values. 

2 Related Work 

Learning control policies from text is gaining in¬ 
creasing interest in the NLP community. Example 
applications include interpreting help documenta¬ 
tion for software (Branavan et ah, 2010|), 


navi- 


gating with directions (jVogel and Jurafsky, 2010 

Kollar et ah, 2010 

Artzi and Zettlemoyer, 2013 

Matuszek et ah, 2013 

Andreas and Klein, 20151 

and playing computer games ( 

Eisenstein et al.. 


Games provide a rich domain for grounded lan¬ 
guage analysis. Prior work has assumed perfect 
knowledge of the underlying state of the game to 


learn policies. Gomiak and Roy (20051 developed 


a game character that can be controlled by spoken 
instructions adaptable to the game situation. The 
grounding of commands to actions is learned from 
a transcript manually annotated with actions and 
state attributes. [Eisenstein et al. (2009 1 learn game 
rules by analyzing a collection of game-related 
documents and precompiled traces of the game. In 
contrast to the above work, our model combines 
text interpretation and strategy learning in a single 
framework. As a result, textual analysis is guided 
by the received control feedback, and the learned 
strategy directly builds on the text interpretation. 

Our work closely relates to an automatic game 
player that utilizes text manuals to learn strategies 
for Civilization ( [Branavan et ah, 201 la[ ). Similar 
to our approach, text analysis and control strate¬ 
gies are learned jointly using feedback provided 
by the game simulation. In their setup, states are 
fully observable, and the model learns a strategy 
by combining state/action features and features 
extracted from text. However, in our application, 
the state representation is not provided, but has to 
be inferred from a textual description. Therefore, 
it is not sufficient to extract features from text to 
supplement a simulation-based player. 

Another related line of work consists of auto¬ 
matic video game players that infer state repre¬ 


sentations directly from raw pixels (Koutnrk et ah. 


2013[ [Mnih et ah, 2015| |. For instance, Mnih et 


al. (20151 learn control strategies using convolu¬ 
tional neural networks, trained with a variant of 
Q-leaming ( Watkins and Dayan, 1992} . While 
both approaches use deep reinforcement learning 
for training, our work has important differences. 
In order to handle the sequential nature of text, we 
use Long Short-Term Memory networks to auto¬ 
matically learn useful representations for arbitrary 
text descriptions. Additionally, we show that de¬ 
composing the network into a representation layer 
and an action selector is useful for transferring the 
learnt representations to new game scenarios. 

3 Background 

Game Representation We represent a game by 
the tuple {H, A, T, R, T'), where H is the set of 
all possible game states, A = {(a, o)} is the set of 






























































all commands (action-object pairs), T{h' \ h, a, o) 
is the stochastic transition function between states 
and R{h, a, o) is the reward function. The game 
state H is hidden from the player, who only re¬ 
ceives a varying textual description, produced by 
a stochastic function : H ^ S. Specifically, 
the underlying state h in the game engine keeps 
track of attributes such as the player’s location, 
her health points, time of day, etc. The function 
T' (also part of the game framework) then converts 
this state into a textual description of the location 
the player is at or a message indicating low health. 
We do not assume access to either H ox ^ for our 
agent during both training and testing phases of 
our experiments. We denote the space of all possi¬ 
ble text descriptions s to be S. Rewards are gener¬ 
ated using R and are only given to the player upon 
completion of in-game quests. 


Q-Learning Reinforcement Learning is a com¬ 
monly used framework for learning control poli¬ 
cies in game environments (Silver et al., 2007} 


Amato and Shani, 2010 Branavan et al., 2011b 


Szita, 20121. The game environment can be 


formulated as a sequence of state transitions 
(s, a, r, s') of a Markov Decision Process (MDP). 
The agent takes an action a in state s by consult¬ 
ing a state-action value function Q{s, a), which is 
a measure of the action’s expected long-term re¬ 


ward. Q-Learning (Watkins and Dayan, 19921 is 


a model-free technique which is used to learn an 
optimal Q{s, a) for the agent. Starting from a ran¬ 
dom Q-function, the agent continuously updates 
its Q-values by playing the game and obtaining re¬ 
wards. The iterative updates are derived from the 
Bellman equation ([Sutton and Barto, 19981: 


(5i+i(s,a) = E[r-I- 7 maxQi(s',a') I s, a] (1) 


where 7 is a discount factor for future rewards and 
the expectation is over all game transitions that in¬ 
volved the agent taking action a in state s. 

Using these evolving Q-values, the agent 
chooses the action with the highest Q{s,a) to 
maximize its expected future rewards. In practice, 
the trade-off between exploration and exploitation 
can be achieved following an e-greedy policy (jSut- 


ton and Barto, 19981, where the agent performs a 


random action with probability e. 


Deep Q-Network In large games, it is often im¬ 
practical to maintain the Q-value for all possible 
state-action pairs. One solution to this problem 
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Figure 2: Architecture of LSTM-DQN: The Rep¬ 
resentation Generator (cpji) (bottom) takes as input 
a stream of words observed in state s and produces 
a vector representation Vs, which is fed into the 
action scorer {(pA) (top) to produce scores for all 
actions and argument objects. 


is to approximate Q{s, a) using a parametrized 
function Q{s,a;9), which can generalize over 
states and actions by considering higher-level at¬ 
tributes ( [Sutton and Barto, 1998[ [Branavan et al.,| 
201 la[). However, creating a good parametrization 


requires knowledge of the state and action spaces. 
One way to bypass this feature engineering is to 
use a Deep Q-Network (DQN) (Mnih et al., 20151. 
The DQN approximates the Q-value function with 
a deep neural network to predict Q{s,a) for all 
possible actions a simultaneously given the cur¬ 
rent state s. The non-linear function layers of the 
DQN also enable it to learn better value functions 
than linear approximators. 


4 Learning Representations and Control 
Policies 

In this section, we describe our model (DQN) and 
describe its use in learning good Q-value approxi¬ 
mations for games with stochastic textual descrip¬ 
tions. We divide our model into two parts. The 
first module is a representation generator that con¬ 
verts the textual description of the current state 
into a vector. This vector is then input into the 
second module which is an action scorer. Fig¬ 
ure [2] shows the overall architecture of our model. 
We learn the parameters of both the representation 
generator and the action scorer jointly, using the 
in-game reward feedback. 

Representation Generator {(pn) The represen¬ 
tation generator reads raw text displayed to the 




























































agent and converts it to a vector representation Vs- 
A bag-of-words (BOW) representation is not suf¬ 
ficient to capture higher-order structures of sen¬ 
tences and paragraphs. The need for a better se¬ 
mantic representation of the text is evident from 
the average performance of this representation in 
playing MUD-games (as we show in Sectionj^. 

In order to assimilate better representations, 
we utilize a Long Short-Term Memory network 


(LSTM) (Hochreiter and Schmidhuber, 1991} as 
a representation generator. LSTMs are recurrent 
neural networks with the ability to connect and 
recognize long-range patterns between words in 
text. They are more robust than BOW to small 
variations in word usage and are able to capture 
underlying semantics of sentences to some ex¬ 
tent. In recent work, LSTMs have been used suc¬ 
cessfully in NLP tasks such as machine transla¬ 
tion ( jSutskever et al., 2014| ) and sentiment anal¬ 


ysis (Tai et al., 20151 to compose vector repre¬ 
sentations of sentences from word-level embed¬ 


dings (Mikolov et al., 2013; Pennington et al.. 


20141. In our setup, the LSTM network takes in 


word embeddings Wk from the words in a descrip¬ 
tion s and produces output vectors at each step. 

To get the final slate represenfafion Vg, we add a 
mean pooling layer which computes fhe elemenf- 
wise mean over fhe oufpuf vectors 


1 


n 


Vs = -'y'xk 

k=\ 


( 2 ) 


Action Scorer The aclion scorer module 

produces scores for fhe sel of possible actions 
given fhe currenl slate represenfafion. We use a 
mulli-layered neural nelwork for Ihis purpose (see 
Figure [^. The inpul to Ihis module is fhe vec¬ 
tor from fhe represenfafion generator, Vg = (f)R{s) 
and fhe oufpufs are scores for actions a £ A. 
Scores for all aclions are predicled simullaneously, 
which is compulalionally more efficienf lhan scor¬ 
ing each slale-aclion pair separately. Thus, by 
combining fhe represenfafion generalor and aclion 
scorer, we can obfain fhe approximafion for fhe Q- 
funclion as Q{s, a) ^ c/)A{4>R{s))[a]. 

An additional complexify in playing MUD- 
games is lhal fhe aclions laken by fhe player are 
mulli-word nalural language commands such as 

^We also experimented with considering just the output 
vector of the LSTM after processing the last word. Empiri¬ 
cally, we find that mean pooling leads to faster learning, so 
we use it in all our experiments. 


eat apple or go east. Due lo compulalional con- 
slrainls, in Ihis work we limif ourselves fo con¬ 
sider commands to consisl of one action (e.g. eat) 
and one argumenf objecf (e.g. apple). This as- 
sumplion holds for fhe majorily of fhe commands 
in our worlds, wilh fhe exceplion of one class of 
commands lhaf require Iwo argumenfs (e.g. move 
red-root right, move blue-root up). We consider all 
possible aclions and objecls available in fhe game 
and predicl bolh for each slale using fhe same nel¬ 
work (Figure [^. We consider fhe Q-value of fhe 
entire command (a, o) lo be fhe average of fhe Q- 
values of fhe aclion a and fhe objecf o. For fhe resf 
of Ihis section, we only show equations for (5(s, a) 
bul similar ones hold for Q(s, o). 


Parameter Learning We learn the parameters 
Or of the representation generator and 9a of the 
action scorer using stochastic gradient descent 
with RMSprop ( |Tieleman and Hinton, 2012 1. The 
complete training procedure is shown in Algo¬ 
rithm [T] In each iteration i, we update the pa¬ 
rameters to reduce the discrepancy between the 
predicted value of the current state Q{st,at;6i) 
(where 9i = [Or] 0a] i) and the expected Q-value 
given the reward rt and the value of the next state 

maxa Q(st+i,a;0i_i). 


We keep track of the agent’s previous experi¬ 
ences in a memory Instead of performing 
updates to the Q-value using transitions from the 
current episode, we sample a random transition 
(s, d, s', r) from 25. Updating the parameters in 
this way avoids issues due to strong correlation 
when using transitions of the same episode ( [Mnih| 
et al., 20f5] l. Using the sampled transition and ([T]l, 
we obtain the following loss function to minimize: 


CiiOi) = ^g,a[{y^-Q{s,d■ei))‘^] (3) 


whereyj = Es,a[?’ + 7 maXa/( 5 (s',a'; 0 i_i) | s,a\ 
is the target Q-value with parameters 0j_i fixed 
from the previous iteration. 

The updates on the parameters 6 can be per¬ 
formed using the following gradient of Ci{6i): 

VeXi{9i) = Es^Q_[2{yi - Q{s, d; 9i))V0^Q{s, d; 0* 


For each epoch of training, the agent plays several 
episodes of the game, which is restarted after ev¬ 
ery terminal state. 

"^The memory is limited and rewritten in a first-in-first-out 
(FIFO) fashion. 





















Algorithm 1 Training Procedure for DQN with prioritized sampling 
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11 

12 

13 

14 
15: 

16: 

17: 


Initialize experience memory V 

Initialize parameters of representation generator {(1)r) and action scorer {(j)A) randomly 
for episode = 1, M do 

Initialize game and get start state description si 

for f = 1, T do 

Convert st (text) to representation Vst using cpR 
if random{) < e then 

Select a random action at 

else 

Compute Q{st, a) for all actions using (f>A{vst) 

Select at = argmax Q{st, a) 

Execute action at and observe reward rt and new state s^+i 
Set priority pt = lif rt > 0, else pt = 0 
Store transition {st, at, rt, st+i,pt) in V 

Sample random mini batch of transitions {sj,aj,rj, from V, 

with fraction p having pj = 1 


Set Vj = 


- ) 'J 


if Sj+i is terminal 

rj + 7 max^/ Q{sj+i,a'; 9) if Sj+i is non-terminal 
Perform gradient descent step on the loss C{6) = {yj — Q{sj, aj] 9))“^ 


Mini-hatch Sampling In practice, online up¬ 
dates to the parameters 9 are performed over a 
mini batch of state transitions, instead of a single 
transition. This increases the number of experi¬ 
ences used per step and is also more efficient due 
to optimized matrix operations. 

The simplest method to create these mini¬ 
batches from the experience memory V is to sam¬ 
ple uniformly at random. However, certain ex¬ 
periences are more valuable than others for the 
agent to learn from. For instance, rare transitions 
that provide positive rewards can be used more of¬ 
ten to learn optimal Q-values faster. In our ex¬ 
periments, we consider such positive-reward tran¬ 
sitions to have higher priority and keep track of 
them in V. We use prioritized sampling (inspired 
by Moore and Atkeson (19931) to sample a frac¬ 
tion p of transitions from the higher priority pool 
and a fraction 1 — p from the rest. 


5 Experimental Setup 

Game Environment For our game environ¬ 
ment, we modify Evennia|^an open-source library 
for building online textual MUD games. Evennia 
is a Python-based framework that allows one to 
easily create new games by writing a batch file 
describing the environment with details of rooms, 

^http://www.evennia.com/ 


Stats 

Home World 

Fantasy World 

Vocabulary size 

84 

1340 

Avg. words / description 

10.5 

65.21 

Max descriptions / room 

3 

100 

# diff. quest descriptions 

12 

- 

State transitions 

Deterministic 

Stochastic 

# states (underlying) 

16 

> 56 

Branching factor 
(# commands / state) 

40 

222 


Table 1: Various statistics of the two game worlds 

objects and actions. The game engine keeps 
track of the game state internally, presenting tex¬ 
tual descriptions to the player and receiving text 
commands from the player. We conduct exper¬ 
iments on two worlds - a smaller Home world 
we created ourselves, and a larger, more com¬ 
plex Fantasy world created by Evennia’s develop¬ 
ers. The motivation behind Home world is to ab¬ 
stract away high-level planning and focus on the 
language understanding requirements of the game. 

Table [T] provides statistics of the game worlds. 
We observe that the Fantasy world is moderately 
sized with a vocabulary of 1340 words and up to 
100 different descriptions for a room. These de¬ 
scriptions were created manually by the game de¬ 
velopers. These diverse, engaging descriptions are 
designed to make it interesting and exciting for hu¬ 
man players. Several rooms have many alternative 
descriptions, invoked randomly on each visit by 












the player. 

Comparatively, the Home world is smaller: it 
has a very restricted vocabulary of 84 words and 
the room descriptions are relatively structured. 
However, both the room descriptions (which are 
also varied and randomly provided to the agent) 
and the quest descriptions were adversarially cre¬ 
ated with negation and conjunction of facts to 
force an agent to actually understand the state in 
order to play well. Therefore, this domain pro¬ 
vides an interesting challenge for language under¬ 
standing. 

In both worlds, the agent receives a positive 
reward on completing a quest, and negative re¬ 
wards for getting into bad situations like falling 
off a bridge, or losing a battle. We also add 
small deterministic negative rewards for each non¬ 
terminating step. This incentivizes the agent to 
learn policies that solve quests in fewer steps. The 
supplementary material has details on the reward 
structure. 

Home World We created Home world to mimic 
the environment of a typical housej^ The world 
consists of four rooms - a living room, a bedroom, 
a kitchen and a garden with connecting pathways. 
Every room is reachable from every other room. 
Each room contains a representative object that the 
agent can interact with. Eor instance, the kitchen 
has an apple that the player can eat. Transitions 
between the rooms are deterministic. At the start 
of each game episode, the player is placed in a ran¬ 
dom room and provided with a randomly selected 
quest. The text provided to the player contains 
both the description of her current state and that 
of the quest. Thus, the player can begin in one 
of 16 different states (4 rooms x 4 quests), which 
adds to the world’s complexity. 

An example of a quest given to the player in 
text is Not you are sleepy now but you are hun¬ 
gry now. To complete this quest and obtain a re¬ 
ward, the player has to navigate through the house 
to reach the kitchen and eat the apple (i.e type in 
the command eat apple). More importantly, the 
player should interpret that the quest does not re¬ 
quire her to take a nap in the bedroom. We cre¬ 
ated such misguiding quests to make it hard for 
agents to succeed without having an adequate level 
of language understanding. 


® An illustration is provided in the supplementary material. 


Fantasy World The Eantasy world is consider¬ 
ably more complex and involves quests such as 
navigating through a broken bridge or finding the 
secret tomb of an ancient hero. This game also has 
stochastic transitions in addition to varying state 
descriptions provided to the player. Eor instance, 
there is a possibility of the player falling from the 
bridge if she lingers too long on it. 

Due to the large command space in this gamej^ 
we make use of cues provided by the game itself to 
narrow down the set of possible objects to consider 
in each state. Eor instance, in the MUD example in 
Eigure 1, the game provides a list of possible exits. 
If the game does not provide such clues for the 
current state, we consider all objects in the game. 

Evaluation We use two metrics for measuring 
an agent’s performance: (1) the cumulative reward 
obtained per episode averaged over the episodes 
and (2) the fraction of quests completed by the 
agent. The evaluation procedure is as follows. In 
each epoch, we first train the agent on M episodes 
of T steps each. At the end of this training, we 
have a testing phase of running M episodes of the 
game for T steps. We use M = 50, T = 20 for the 
Home world and M = 20, T = 250 for the Ean¬ 
tasy world. Eor all evaluation episodes, we run the 
agent following an e-greedy policy with e = 0.05, 
which makes the agent choose the best action ac¬ 
cording to its Q-values 95% of the time. We report 
the agent’s performance at each epoch. 

Baselines We compare our ESTM-DQN model 
with three baselines. The first is a Random agent 
that chooses both actions and objects uniformly at 
random from all available choices 0 The other two 
are BOW-DQN and BI-DQN, which use a bag- 
of-words and a bag-of-bigrams representation of 
the text, respectively, as input to the DQN action 
scorer. These baselines serve to illustrate the im¬ 
portance of having a good representation layer for 
the task. 

Settings Eor our DQN models, we used V = 
100000 ,7 = 0.5. We use a learning rate of 0.0005 
for RMSprop. We anneal the e for e-greedy from 
1 to 0.2 over 100000 transitions. A mini-batch 
gradient update is performed every 4 steps of the 
gameplay. We roll out the ESTM (over words) for 

^We consider 222 possible command combinations of 6 
actions and 37 object arguments. 

*In the case of the Fantasy world, the object choices are 
narrowed down using game clues as described earlier. 
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Figure 3: Left: Graphs showing the evolution of average reward and quest completion rate for BOW- 
DQN, LSTM-DQN and a Random baseline on the Home world (top) and Fantasy world (bottom). Note 
that the reward is shown in log scale for the Fantasy world. Right: Graphs showing effects of transfer 
learning and prioritized sampling on the Home world. 


a maximum of 30 steps on the Home world and for 
100 steps on the Fantasy world. For the prioritized 
sampling, we used p = 0.25 for both worlds. We 
employed a mini-batch size of 64 and word em¬ 
bedding size d = 20 in all experiments. 

6 Results 

Home World Figure illustrates the perfor¬ 
mance of LSTM-DQN compared to the baselines. 
We can observe that the Random baseline per¬ 
forms quite poorly, completing only around 10% 
of quests on averag^ obtaining a low reward of 
around —1.58. The BOW-DQN model performs 
significantly better and is able to complete around 
46% of the quests, with an average reward of 0.20. 
The improvement in reward is due to both greater 
quest success rate and a lower rate of issuing in¬ 
valid commands (e.g. eat apple would be invalid 
in the bedroom since there is no apple). We no¬ 
tice that both the reward and quest completion 
graphs of this model are volatile. This is because 
the model fails to pick out differences between 
quests like Not you are hungry now but you are 
sleepy now and Not you are sleepy now but you 

^Averaged over the last 10 epochs. 


are hungry now. The BI-DQN model suffers from 
the same issue although it performs slightly bet¬ 
ter than BOW-DQN by completing 48% of quests. 
In contrast, the LSTM-DQN model does not suf¬ 
fer from this issue and is able to complete 100% 
of the quests after around 50 epochs of training, 
achieving close to the optimal reward possiblej^ 
This demonstrates that having an expressive rep¬ 
resentation for text is crucial to understanding the 
game states and choosing intelligent actions. 

In addition, we also investigated the impact of 
using a deep neural network for modeling the ac¬ 
tion scorer (jiA- Figure illustrates the perfor¬ 
mance of the BOW-DQN and BI-DQN models 
along with their simpler versions BOW-LIN and 
BI-LIN, which use a single linear layer for </)^. It 
can be seen that the DQN models clearly achieve 
better performance than their linear counterparts, 
which points to them modeling the control policy 
better. 

Fantasy World We evaluate all the models on 
the Fantasy world in the same manner as before 
and report reward, quest completion rates and Q- 

*°Note that since each step incurs a penalty of —0.01, the 
best reward (on average) a player can get is around 0.98. 
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Figure 4: Quest completion rates of DQN vs. Lin¬ 
ear models on Home world. 

values. The quest we evaluate on involves crossing 
the broken bridge (which takes a minimum of five 
steps), with the possibility of falling off at random 
(a 5% chance) when the player is on the bridge. 
The game has an additional quest of reaching a 
secret tomb. However, this is a complex quest that 
requires the player to memorize game events and 
perform high-level planning which are beyond the 
scope of this current work. Therefore, we focus 
only on the first quest. 

From Figure (bottom), we can see that the 
Random baseline does poorly in terms of both av¬ 
erage per-episode rewar4^ and quest completion 
rates. BOW-DQN converges to a much higher av¬ 
erage reward of —12.68 and achieves around 82% 
quest completion. Again, the BOW-DQN is often 
confused by varying (10 different) descriptions of 
the portions of the bridge, which reflects in its er¬ 
ratic performance on the quest. The BI-DQN per¬ 
forms very well on quest completion by finishing 
97% of quesfs. However, fhis model lends lo find 
sub-opfimal solulions and gets an average reward 
of —26.68, even worse than BOW-DQN. One rea¬ 
son for this is the negative rewards the agent ob¬ 
tains after falling off the bridge. The LSTM-DQN 
model again performs best, achieving an average 
reward of —11.33 and completing 96% of quests 
on average. Though this world does not con¬ 
tain descriptions adversarial to BOW-DQN or BI- 
DQN, the LSTM-DQN obtains higher average re¬ 
ward by completing the quest in fewer steps and 
showing more resilience to variations in the state 
descriptions. 

Transfer Learning We would like the represen¬ 
tations learnt by cj)ji to be generic enough and 

"Note that the rewards graph is in log scale. 
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Figure 5: t-SNE visualization of the word embed¬ 
dings (except stopwords) after training on Home 
world. The embedding values are initialized ran¬ 
domly. 

transferable to new game worlds. To test this, 
we created a second Home world with the same 
rooms, but a completely different map, changing 
the locations of the rooms and the pathways be¬ 
tween them. The main differentiating factor of 
this world from the original home world lies in the 
high-level planning required to complete quests. 

We initialized the LSTM part of an LSTM- 
DQN agent with parameters Or learnt from the 
original home world and trained it on the new 
wor Figure (top right) demonstrates that 
the agent with transferred parameters is able to 
learn quicker than an agent starting from scratch 
initialized with random parameters (No Transfer), 
reaching the optimal policy almost 20 epochs ear¬ 
lier. This indicates that these simulated worlds can 
be used to learn good representations for language 
that transfer across worlds. 

Prioritized sampling We also investigate the ef¬ 
fects of different minibatch sampling procedures 
on the parameter learning. From Figure [^(bottom 
right), we observe that using prioritized sampling 
significantly speeds up learning, with the agent 
achieving the optimal policy around 50 epochs 
faster than using uniform sampling. This shows 
promise for further research into different schemes 
of assigning priority to transitions. 

Representation Analysis We analyzed the rep¬ 
resentations learnt by the LSTM-DQN model on 
the Home world. Figure shows a visualization 

*^The parameters for the Action Scorer {9a) are initialized 
randomly. 
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Description 

Nearest neighbor 

You ai'e halfways out on the unstable bridge. From the castle 
you hear a distant howling sound, like that of a large dog or 

other beast. 

The bridge slopes precariously where it extends westwards to¬ 
wards the lowest point - the center point of the hang bridge. You 
clasp the ropes firmly as the bridge sways and creaks under you. 

The ruins opens up to the sky in a small open area, lined by 
columns. ... To the west is the gatehouse and entrance to the 
castle, whereas southwards the columns make way for a wide 
open courtyard. 

The old gatehouse is near collapse.East the gatehouse leads 

out to a small open area surrounded by the remains of the cas¬ 
tle. There is also a standing archway offering passage to a path 
along the old southern inner wall. 


Table 2: Sample descriptions from the Fantasy world and their nearest neighbors (NN) according to their 
vector representations from the LSTM representation generator. The NNs are often descriptions of the 
same or similar (nearby) states in the game. 


of learnt word embeddings, reduced to two di¬ 
using t-SNE ( [Van der Maaten and Hin- 
). All the vectors were initialized ran¬ 
domly before training. We can see that semanti¬ 
cally similar words appear close together to form 
coherent subspaces. In fact, we observe four dif¬ 
ferent subspaces, each for one type of room along 
with its corresponding object(s) and quest words. 
For instance, food items like pizza and rooms like 
kitchen are very close to the word hungry which 
appears in a quest description. This shows that 
the agent learns to form meaningful associations 
between the semantics of the quest and the envi¬ 
ronment. Table shows some examples of de¬ 
scriptions from Fantasy world and their nearest 
neighbors using cosine similarity between their 
corresponding vector representations produced by 
FSTM-DQN. The model is able to correlate de¬ 
scriptions of the same (or similar) underlying 
states and project them onto nearby points in the 
representation subspace. 

7 Conclusions 


mensions 


ton, 2008 


planning and strategy learning to improve the per¬ 
formance of intelligent agents. 
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